class: center, middle, inverse, title-slide # A workflow for open and reproducible MRI studies ## Discussion series on “Human Research Data in Practice” ### Lennart Wittkuhn (
wittkuhn@mpib-berlin.mpg.de
) ### Max Planck Research Group NeuroCode
Max Planck Institute for Human Development, Berlin, Germany ### 22th of June 2021 --- # About #### About me - PhD student at the [Max Planck Research Group "NeuroCode"](https://www.mpib-berlin.mpg.de/research/research-groups/mprg-neurocode) at the [Max Planck Institute for Human Development](https://www.mpib-berlin.mpg.de/en) in Berlin - Research: I study the role of fast neural reactivation ("replay") in decision-making in humans using fMRI - Member of the MPIB's working group on research data management - You can contact me via [email](mailto:wittkuhn@mpib-berlin.mpg.de), [Twitter](https://twitter.com/lnnrtwttkhn), [GitHub](https://github.com/lnnrtwttkhn) or [LinkedIn](https://www.linkedin.com/in/lennart-wittkuhn-6a079a1a8/) -- #### About this presentation - **Slides:** Slides are publicly available via https://lennartwittkuhn.com/talk-rdm/ - **Source:** Source code publicly available on GitHub: https://github.com/lnnrtwttkhn/talk-rdm/ - **Links:** This presentation contains links to external resources. I do not take responsibility for the accuracy, legality or content of the external site or for that of subsequent links. If you notice an issue with a link, please contact me! - **Notes:** Collaborative notes during the talk via [HedgeDoc](https://pad.gwdg.de/F9bVf_flR82RczuISMQnwg#) (publicly available!) - **Contact**: I am happy for any feedback or suggestion via [email](mailto:wittkuhn@mpib-berlin.mpg.de), [HedgeDoc](https://pad.gwdg.de/F9bVf_flR82RczuISMQnwg#) or [GitHub issues](https://github.com/lnnrtwttkhn/talk-rdm/issues). Thank you! 🙏 --- # Agenda 1. **Introduction** 2. **Workflow** - Data Management - Code Management - Computational Environments 3. **Discussion** --- class: title-slide, center, middle name: introduction # Introduction <!--the next --- is a horizontal line--> --- --- # The philosophy .pull-left[ *"An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures."* Buckheit & Donoho (1995), paraphrasing Jon Claerbout ] .pull-right[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#https://reproducibility.org/blog/wp-content/uploads/2015/08/claerbout1.jpg" alt="Jon Claerbout <br>Geophysicist at Stanford University" width="50%" /> <p class="caption">Jon Claerbout <br>Geophysicist at Stanford University</p> </div> ] .pull-left[ *"Open Science should just be science"* ] ??? - Jon Claerbout, a distinguished exploration geophysicist at Stanford University - He has also pointed out that we have reached a point where solutions are available - it is now possible to publish computational research that is really reproducible by others. --- # The challenge <img src="data:image/png;base64,#https://keeper.mpdl.mpg.de/f/ead22cde6d724eda81d2/?dl=1" width="50%" style="display: block; margin: auto;" /> ??? - Data is produced through code (e.g., task code) - Data is manipulated by code and new data is generated - Mapping between input and output data - This happens using specific software in specific versions --- # "Practice" of research code and data management - "*Where is the data?*" -- - "*Can I see your code?*" -- - "*Which version of the code and data did you use to produce this result?*" -- - "*What is the difference between `data_version1_edit.csv` and `data_version8_new_final.csv`?*" -- - "*I get different results on my machine ...*" -- - "*But it worked when I ran it last month?!*" --- # The solution - Code, data and computational environments change all the time! - Example: Running the same analysis on your laptop, the cluster, or your collaborator's computer - → We need **version-control** --- # Our paper <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#https://keeper.mpdl.mpg.de/f/ea0795d894e44fd3ad18/?dl=1" alt="<a href="https://doi.org/10.1038/s41467-021-21970-2" target="_blank">doi: 10.1038/s41467-021-21970-2</a> (accessed 17/06/21)" width="75%" /> <p class="caption"><a href="https://doi.org/10.1038/s41467-021-21970-2" target="_blank">doi: 10.1038/s41467-021-21970-2</a> (accessed 17/06/21)</p> </div> -- #### Two-sentence summary: > Non-invasive measurement of fast neural activity with spatial precision in humans is difficult. Here, the authors show how fMRI can be used to detect sub-second neural sequences in a localized fashion and report fast replay of images in visual cortex that occurred independently of the hippocampus. --- class: title-slide, center, middle name: workflow-data # Workflow: Data Management <!--the next --- is a horizontal line--> --- --- # Data management using DataLad: Overview #### From Wittkuhn & Schuck, 2021, *Nature Communications* (see [Data Availability statement](https://www.nature.com/articles/s41467-021-21970-2#data-availability)): > *"We publicly share all data used in this study. Data and code management was realized using DataLad [version 0.13.0142, for details, see https://www.datalad.org/].*" -- - All individual datasets can be found at: https://gin.g-node.org/lnnrtwttkhn - Each dataset is associated with a unique URL and a Digital Object Identifier (DOI) - Dataset structure shared to GitHub and dataset contents shared to GIN -- #### All data? -- - MRI and behavioral data adhering to the [BIDS standard](https://bids.neuroimaging.io/) ([GitHub](https://github.com/lnnrtwttkhn/highspeed-bids), [GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-bids), [DOI](https://doi.org/10.12751/g-node.4ivuv8)) - MRI quality metrics and reports based on [MRIQC](https://mriqc.readthedocs.io/en/stable/) ([GitHub](https://github.com/lnnrtwttkhn/highspeed-mriqc), [GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-mriqc), [DOI](https://doi.org/10.12751/g-node.0vmyuh)) - preprocessed MRI data using [fMRIPrep](https://fmriprep.org/en/stable/), ([GitHub](https://github.com/lnnrtwttkhn/highspeed-fmriprep), [GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-fmriprep), [DOI](https://doi.org/10.12751/g-node.0ft06t)) - binarized anatomical masks used for feature selection ([GitHub](https://github.com/lnnrtwttkhn/highspeed-masks), [GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-masks), [DOI](https://doi.org/10.12751/g-node.omirok)) - first-level GLM results used for feature selection ([GitHub](https://github.com/lnnrtwttkhn/highspeed-glm), [GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-glm), [DOI](https://doi.org/10.12751/g-node.d21zpv)) - results of the multivariate decoding approach ([GitHub](https://github.com/lnnrtwttkhn/highspeed-decoding), [GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-decoding), [DOI](https://doi.org/10.12751/g-node.9zft1r)) - unprocessed data of the behavioral task acquired during MRI acquisition ([GitHub](https://github.com/lnnrtwttkhn/highspeed-data-behavior), [GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-data-behavior), [DOI](https://doi.org/10.12751/g-node.p7dabb)) \> 1.5 TB in total, version-controlled using DataLad --- # Data organization: Relying on community standards #### Organization of "raw data" according to [Brain Imaging Data Structure (BIDS)](https://bids.neuroimaging.io/) Hello <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fsdata.2016.44/MediaObjects/41597_2016_Article_BFsdata201644_Fig1_HTML.jpg?as=webp" alt="see Gorgolewski et al., 2016, <i>Nature Scientific Data</i></br><a href="https://doi.org/10.1038/sdata.2016.44" target="_blank">doi: 10.1038/sdata.2016.44</a>" width="60%" /> <p class="caption">see Gorgolewski et al., 2016, <i>Nature Scientific Data</i></br><a href="https://doi.org/10.1038/sdata.2016.44" target="_blank">doi: 10.1038/sdata.2016.44</a></p> </div> --- # Data sharing via GIN <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#https://gin.g-node.org/img/favicon.png" alt="<a href="https://gin.g-node.org/" target="_blank">https://gin.g-node.org/</a>" width="10%" /> <p class="caption"><a href="https://gin.g-node.org/" target="_blank">https://gin.g-node.org/</a></p> </div> > "*GIN is [...] a web-accessible repository store of your data based on git and git-annex that you can access securely anywhere you desire while keeping your data in sync, backed up and easily accessible [...]"* -- #### Advantages of GIN (non-exhaustive list) - free and open-source (could be hosted within MPIs / MPG) - supports private and public repositories - publicly funded by the Federal Ministry of Education and Research (BMBF) - servers are on German land (near Munich, Germany) - provides Digital Object Identifiers (DOIs) (details [here](https://gin.g-node.org/G-Node/Info/wiki/DOI)) - allows to set own license (details [here](https://gin.g-node.org/G-Node/Info/wiki/Licensing)) - Both using git + git-annex, DataLad plays perfectly with GIN (details [here](https://handbook.datalad.org/en/latest/basics/101-139-gin.html)) --- # Publishing a DataLad dataset to GIN in only 4 steps Step 1: Create a dataset ```bash datalad create my_dataset ``` -- Step 2: Save data into the dataset ```bash datalad save -m "add data to dataset" ``` -- Step 3: Add the GIN remote ("sibling") ```bash datalad siblings add -d . --name gin --url git@gin.g-node.org:/my_username/my_dataset.git ``` -- Step 4: Transfer the dataset to GIN ```bash datalad push --to gin ``` -- Done! 🎉 --- # Sharing data via third-party infrastructure <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/publishing_network_publishparts2.svg" alt="see <a href="http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html" target="_blank">DataLad Handbook: Third-party infrastructure</a>" width="45%" /> <p class="caption">see <a href="http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html" target="_blank">DataLad Handbook: Third-party infrastructure</a></p> </div> -- #### Suggested alternatives to GIN that can be used with DataLad: - [Keeper](https://keeper.mpdl.mpg.de/) (Seafile) offers all Max Planck employees 1TB(!) of storage (expandable) - [Open Science Framework (OSF)](https://osf.io/), pupular in Psychology / Neuroscience (see [details](http://docs.datalad.org/projects/osf/en/latest/)) --- # What is DataLad? #### What is DataLad (see [10,000 feet overview](http://handbook.datalad.org/en/latest/intro/executive_summary.html))? - **"Git for data"** - Free, [open-source](https://github.com/datalad/datalad) **command-line tool** - Building on top of **Git** and **git-annex**, DataLad allows you to **version control arbitrarily large files** in datasets. - *"Arbitrarily large?"* - yes, see DataLad dataset of 80TB / 15 million files from the Human Connectome Project (see [details](https://handbook.datalad.org/en/latest/usecases/HCP_dataset.html#usecase-hcp-dataset)) - DataLad doesn't "lock you in" -- - I'm "only" an enthusiastic user of DataLad! 😊 --- class: title-slide, center, middle name: workflow-code # Workflow: Code Management <!--the next --- is a horizontal line--> --- --- # Code sharing using Git and DataLad #### From Wittkuhn & Schuck, 2021, *Nature Communications* (see [Code Availability statement](https://www.nature.com/articles/s41467-021-21970-2#code-availability)): > "*We share all code used in this study. An overview of all the resources is publicly available on our project website: https://wittkuhn.mpib.berlin/highspeed/.*" - code for the main statistical analyses ([GitHub](https://github.com/lnnrtwttkhn/highspeed-analysis), [GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-analysis), [DOI](https://doi.org/10.12751/g-node.eqqdtg)) - code for the behavioral task ([GitHub](https://github.com/lnnrtwttkhn/highspeed-task), [Zenodo](https://doi.org/10.5281/zenodo.4305888)) -- #### ... and the rest? > *"We [...] share all data listed in the Data availability section in modularized units alongside the code that created the data, usually in a dedicated `code` directory in each dataset, instead of separate data and code repositories."* > *"This approach allows to better establish the provenance of data (i.e., a better understanding which code and input data produced which output data), loosely following the **DataLad YODA principles** [...]*" --- # **Y**ODAs **O**rganigam on **D**ata **A**nalysis <sup>1</sup> #### P1: *"One thing, one dataset"* (**Modularity**) #### P2: *"Record where you got it from, and where it is now"* (**Provenance**) #### P3: *"Record what you did to it, and with what"* (**Reproducibility**) .footnote[ <sup>1</sup> a [ recursive acronym](https://en.wikipedia.org/wiki/Recursive_acronym) ] -- </br> #### Learn about YODA, you must: - DataLad Handbook: "YODA: Best practices for data analyses in a dataset" (see [details](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html)) - "YODA: YODA's Organigram on Data Analysis" - Poster by Hanke et al., 2018, presented at the 24th Annual Meeting of the Organization for Human Brain Mapping (OHBM) 2018 - CC BY 4.0, [doi: 10.7490/f1000research.1116363.1](https://doi.org/10.7490/f1000research.1116363.1) - A short talk on YODA principles (see [details](https://github.com/myyoda/talk-principles)) → Details on YODA principles can also be found in the Appendix --- # Project website with main statistical results #### From Wittkuhn & Schuck, 2021, *Nature Communications* (see [Code Availability statement](https://www.nature.com/articles/s41467-021-21970-2#code-availability)): > "*We share all code used in this study. An overview of all the resources is publicly available on our **project website.**"* Project website publicly available at https://wittkuhn.mpib.berlin/highspeed/ --- # Outlook, challenges and discussions > *"He [Jon Claerbout] has also pointed out that we have reached a point where solutions are available - it is now possible to publish computational research that is really reproducible by others.*" Buckheit & Donoho (1995), describing how Jon Claerbout and his team shared CD-ROMs with interactive code that could regenerate the figures in their books - in the *early 90s* *(addition in parantheses)* -- #### All technical solutions are available! The long-term challenges are: - moving towards a "culture of reproducibility" (cf. Russ Poldrack, see e.g., [this talk](https://www.youtube.com/watch?v=XjW3t-qXAiE)) - changing incentives / funding schemes - education, education, education - "slow science" --- # Challenge: Relationship between code and data - *"Which code produced which data?"* - *"In which order do I need to execute the code?"* -- #### Example solutions - [datalad run](http://docs.datalad.org/en/stable/generated/man/datalad-run.html) > `datalad run` *"[...] will record a shell command, and save all changes this command triggered in the dataset – be that new files or changes to existing files."* (see [details](http://handbook.datalad.org/en/latest/basics/basics-run.html) in the DataLad handbook) - [GNU Make](https://www.gnu.org/software/make/) > *"Make enables [...] to build and install your package without knowing the details of how that is done -- because these details are recorded in the makefile that you supply."* > *"Make figures out automatically which files it needs to update, based on which source files have changed. It also automatically determines the proper order for updating files [...]"* --- # Challenge: Implementing a Data User Agreement (DUA) #### From Wittkuhn & Schuck, 2021, project website (see section on [license information](https://wittkuhn.mpib.berlin/highspeed/#license-information)): > If you download any of the published data, please complete our Data User Agreeement (DUA). The Data User Agreement (DUA) we use for this study, was taken from the Open Brain Consent project, distributed under Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0). - based on templates and recommendations of the [Open Brain Consent](https://open-brain-consent.readthedocs.io/en/stable/) project (licensed [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en)) - optional for data from Wittkuhn & Schuck, 2021 - not possible to implement mandatory DUA on GIN --- # Challenge: Standardizing data organization <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#https://imgs.xkcd.com/comics/standards.png" alt="<a href="https://xkcd.com/927/" target="_blank">xkcd cartoon "Standards"</a>" width="60%" /> <p class="caption"><a href="https://xkcd.com/927/" target="_blank">xkcd cartoon "Standards"</a></p> </div> --- # Thank you! --- class: title-slide, center, middle name: appendix-datalad-overview # Appendix: DataLad Overview <!--the next --- is a horizontal line--> --- --- # DataLad: What is a dataset? > *"A dataset is a directory on a computer that DataLad manages."* <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/dataset.svg" alt="see <a href="http://handbook.datalad.org/en/latest/basics/101-101-create.html" target="_blank">DataLad Handbook: Create a new dataset</a>" width="40%" /> <p class="caption">see <a href="http://handbook.datalad.org/en/latest/basics/101-101-create.html" target="_blank">DataLad Handbook: Create a new dataset</a></p> </div> > "*You can create new, empty datasets [..] and populate them, or transform existing directories into datasets.*" --- # DataLad: Version-control arbitrarily large files > *"Building on top of Git and git-annex, DataLad allows you to version control arbitrarily large files in datasets."* <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/local_wf.svg" alt="see <a href="http://handbook.datalad.org/en/latest/basics/101-102-populate.html" target="_blank">DataLad Handbook: How to pupulate a dataset</a>" width="40%" /> <p class="caption">see <a href="http://handbook.datalad.org/en/latest/basics/101-102-populate.html" target="_blank">DataLad Handbook: How to pupulate a dataset</a></p> </div> > *"[...] keep track of revisions of data of any size, and view, interact with or restore any version of your dataset [...]."* --- # DataLad: Dataset consumption and collaboration > *"DataLad lets you consume datasets provided by others, and collaborate with them."* > *"You can **install existing datasets** and update them from their sources, or create sibling datasets that you can **publish updates** to and **pull updates** from for collaboration and data sharing."* <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/collaboration.svg" alt="see <a href="https://handbook.datalad.org/en/latest/basics/101-105-install.html" target="_blank">DataLad Handbook: Install an existing dataset</a>" width="70%" /> <p class="caption">see <a href="https://handbook.datalad.org/en/latest/basics/101-105-install.html" target="_blank">DataLad Handbook: Install an existing dataset</a></p> </div> --- # DataLad: Dataset linkage > *"Datasets can contain other datasets (subdatasets), **nested arbitrarily deep.**"* <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/linkage_subds1.svg" alt="see <a href="http://handbook.datalad.org/en/latest/basics/101-106-nesting.html" target="_blank">DataLad Handbook: Nesting datasets</a>" width="70%" /> <p class="caption">see <a href="http://handbook.datalad.org/en/latest/basics/101-106-nesting.html" target="_blank">DataLad Handbook: Nesting datasets</a></p> </div> > *"Each dataset has an independent [...] history, but can be registered at a precise version in higher-level datasets. This allows to **combine datasets** and to perform commands recursively across a hierarchy of datasets, and it is the basis for advanced provenance capture abilities."* --- # DataLad: Full provenance capture and reproducibility > *"DataLad allows to **capture full provenance**: The origin of datasets, the origin of files obtained from web sources, complete machine-readable and automatically reproducible records of how files were created (including software environments)."* <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/reproducible_execution.svg" alt="see <a href="http://handbook.datalad.org/en/latest/usecases/provenance_tracking.html" target="_blank">DataLad Handbook: Provenance tracking</a> and <a href="http://handbook.datalad.org/en/latest/basics/basics-run.html" target="_blank">run commands</a>" width="50%" /> <p class="caption">see <a href="http://handbook.datalad.org/en/latest/usecases/provenance_tracking.html" target="_blank">DataLad Handbook: Provenance tracking</a> and <a href="http://handbook.datalad.org/en/latest/basics/basics-run.html" target="_blank">run commands</a></p> </div> > *"You or your collaborators can thus re-obtain or reproducibly **recompute content with a single command**, and make use of extensive provenance of dataset content **(who created it, when, and how?)**."* --- # DataLad: Third party service integration > *"**Export datasets to third party services** such as GitHub, GitLab, or Figshare with built-in commands."* <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/thirdparty.svg" alt="see <a href="http://handbook.datalad.org/en/latest/basics/basics-thirdparty.html" target="_blank">DataLad Handbook: Third-party infrastructure</a>" width="60%" /> <p class="caption">see <a href="http://handbook.datalad.org/en/latest/basics/basics-thirdparty.html" target="_blank">DataLad Handbook: Third-party infrastructure</a></p> </div> > *"Alternatively, you can use a **multitude of other available third party services** such as Dropbox, Google Drive, Amazon S3, owncloud, or many more that DataLad datasets are compatible with."* --- # DataLad: Metadata handling > *"**Extract, aggregate, and query dataset metadata.** This allows to automatically obtain metadata according to different metadata standards (EXIF, XMP, ID3, BIDS, DICOM, NIfTI1, ...), store this metadata in a portable format, share it, and search dataset contents."* <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/metadata_prov_imaging.svg" alt="see <a href="http://docs.datalad.org/en/stable/metadata.html" target="_blank">DataLad Handbook: Metadata</a>" width="100%" /> <p class="caption">see <a href="http://docs.datalad.org/en/stable/metadata.html" target="_blank">DataLad Handbook: Metadata</a></p> </div> --- class: title-slide, center, middle name: appendix-datalad-yoda # Appendix: DataLad YODA principles <!--the next --- is a horizontal line--> --- --- # P1: *"One thing, one dataset"* - Structure study elements (data, code, results) in dedicated directories - Input data in `/inputs`, code in `/code`, results in `/outputs`, execution environments in `/envs` - Use dedicated projects for multiple different analyses <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#https://handbook.datalad.org/en/latest/_images/dataset_modules.svg" alt="see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a>" width="60%" /> <p class="caption">see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a></p> </div> --- # P2: *"Record where you got it from, and where it is now"* - Record where the data came from, or how it is dependent on or linked to other data - Link re-usable data resource units as DataLad *subdatasets* - `datalad clone`, `datalad download-url`, `datalad save` .pull-left[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#https://handbook.datalad.org/en/latest/_images/data_origin.svg" alt="see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a>" width="70%" /> <p class="caption">see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a></p> </div> ] .pull-right[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#https://handbook.datalad.org/en/latest/_images/decentralized_publishing.svg" alt="see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a>" width="120%" /> <p class="caption">see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a></p> </div> ] --- # P3: *"Record what you did to it, and with what"* - Know how exactly the content of every file came to be that was not obtained from elsewhere - `datalad run` links input data with code execution to output data - `datalad containers-run` allows to do the same *within* software containers (e.g., Docker or Singularity) <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#https://handbook.datalad.org/en/latest/_images/decentralized_publishing.svg" alt="see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a>" width="50%" /> <p class="caption">see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a></p> </div> --- # DataLad: Resources, tutorials and teaching materials - The [DataLad Handbook](http://handbook.datalad.org/en/latest/) is an incredibly extensive resource - YouTube video: ["What is DataLad"](https://www.youtube.com/watch?v=IN0vowZ67vs) - YouTube video: Michael Hanke: ["How to introduce data management technology without sinking the ship?"](https://www.youtube.com/watch?v=uH75kYgwLH4)